Bellabeat Case Study
Background
The Bellabeat Case Study involves an imagined company, the eponymous Bellabeat, a manufacturer of health - focused technology products marketed particularly to women. The company has carved out its own share of the market due to its product quality and niche focus, but Bellabeat’s leadership believes they have the potential to become a truly heavy hitter in the global realm of wearable and health-focused tech.
The analyst I role play as is tasked by Bellabeat’s Chief Creative Officer to analyze smart device usage data to garner insight about consumers of Bellabeat’s competitors, the other players in the arena of wearable health tech, in order to make data driven decisions on improving and marketing Bellabeat’s products.
Business Task
Objective 1: We will first collect, verify, and prepare any data we can find about consumers of wearable health technology. Ordinarily, this would be a journey in and of itself, but the case study points to a useful collection of CSV files on Kaggle with data collected from 33 Fitbit users over the course of 1 month. This data is expounded on below, under the The Data header.
Objective 2: Next, we will inspect the data we’ve collected to glean any helpful insights, trends, or implications. These will inform the decisions we advise the leadership to undertake.
Objective 3: Finally, we are to use the conclusions we have drawn from the information and, focusing on a particular Bellabeat product, provide recommendations on how to use these conclusions to guide decisions made in Bellabeat’s marketing strategy.
The Data
About
The original data, posted by Kaggle user MÖBIUS, can be found here:
The information itself was generated by respondents to a distributed survey via Amazon Mechanical Turk from the days of March 12, 2016 to May 12, 2016. Thirty-three eligible Fitbit users consented to the submission of personal tracker data, involving up to minute-level output for physical activity, heart rate, and sleep monitoring. This information was then fitted to the features present in this data set, including classifications like that of activity intensity, measurements like that of calories burned, and up to minute level observations of variables such as sleep status and heart rate.
Drawbacks
The data is marvelously detailed and dense, but the sample size and duration of record might be too small to consider for confident data-driven decision making. In addition, Bellabeat markets directly to women, and this data does not classify by gender since it is anonymous.
However, what this information lacks in sample size and target demographic, it makes up for in sheer detail. These types of data sets that describe a snapshot of target behavior can be helpful in discerning outstanding or significant trends to inspect further using more widespread and comprehensive data if such can be found.
Some Remedies
Luckily, a short Kaggle search reveals another data set, posted by Kaggle user Akash Kumar, that seems to include similar data regarding a number of Fitbit Users from March 11, 2016 to April 11, 2016, which I noted was suspiciously adjacent to the study the original data set was built from (which took place from April 12 to May 12 of the same year). Upon further inspection (extrapolated on under the (italics) Prepare header), I found that all the users were the same, and this is essentially a precursor data set to the one from April to May. This gives us some more time context to work with, which will further reinforce our analysis.
Ask
Based on the data and the assumption that as a health tech company, Bellabeat strives to market itself as health-positive, intelligent, and convenient, we will start with the following questions.
- What are the general trends of activity, both physical and interactive, exhibited by wearable health tech consumers?
- How do consumers measure up to recommended health standards, and each other?
- Are there any features that suggest insightful relationships, and are these indicative of something Bellabeat can exploit in its marketing strategy?
Prepare
We will start by looking at the different tables in the data set. I used Google Sheets and this source to supplement my observations, and find some inspiration for what direction to take this analysis. These tables are ranked in order of scope - daily activity, at the top, measures a wide range of variables over days. As the list descends, variable scope becomes more exclusive, and measurement more precise, from hours, to minutes, to seconds.
The Tables
1) Daily Activity (940 obs. of 15 variables)
- features :
- Identification : ID, date (per day)
- Aggregates : total steps, distance, tracker distance, calories, logged activity distance
- Measures : distance (Very/Moderate/Lightly/Sedentary), minutes (Very/Fairly/Lightly/Sedentary)
2) Daily Sleep (413 obs. of 5 variables)
- features :
- Identification : ID, date of sleep (midnight every day day)
- Aggregates : total sleep records
- Measures : total time in bed, total minutes asleep
3) Weight Log (67 obs. of 8 variables)
- features :
- Identification : ID, log ID, date (datetime)
- Aggregates : isManualReport (was the weight log done manually?)
- Measures : weight(kg and lb), total fat, BMI
4) Daily Calories, Daily Steps
- standalone tables for the calories and total steps features from Daily Activity, redundant
5) Hourly and Minutely Calories
- Calories burned by user ID and corresponding time frame
- I wouldn’t recommend trying to open it in a spreadsheet
- Minute Calories
- Narrow lists by minute
- Wide lists by hour, makes each calorie per minute a feature
6) Hourly Intensity (22099 obs. of 4 variables) and Minute intensity (even bigger)
- features :
- Identification : ID, date (datetime, stratified per hour)
- Measures : total intensity, average intensity
- minute intensity does what minute calories does but for intensity measure
7) Minute sleep (188521 obs of 4 variables)
- features :
- Identification : ID, date, logID
- Measures : value (indicates sleep state –> 1 = asleep, 2 = restless, 3 = awake)
Preparing Daily Sleep, Daily Activity, and Weight Log
Since our questions are comprehensive, it might prove beneficial to start with just the first few data sets, and work our way down as needed, for peace of mind and memory. This will also help us stay organized in our search - I’ve found that it can be frustratingly easy to forget which table you’re working with and accidentally change something you shouldn’t have. This is only slightly harder than overlooking said issue until it ruins your calculations and you need to take a 10-minute walk before seeing any code again.
Let’s begin with our daily activity, weight log, and daily sleep tables. We will highlight the preparatory process for these three in this section, but the prepare task is one which should be completed for each data set used for analysis. We will encounter that too, but on an as-needed basis so as not to clutter our code.
I’ll be using R, with the help of readr and tidyverse to get a quick look at our information
Load imports and tables
## Imports
library(readr)
library(tidyverse)
## Read in data
daily_act <- read_csv('fitbit/dailyActivity_merged.csv')
daily_sleep <- read_csv('fitbit/sleepDay_merged.csv')
weight_log <- read_csv('fitbit/weightLogInfo_merged.csv')Let’s start by taking a peek at each of the tables.
# daily activity
daily_act %>% head(3) %>% knitr::kable() | Id | ActivityDate | TotalSteps | TotalDistance | TrackerDistance | LoggedActivitiesDistance | VeryActiveDistance | ModeratelyActiveDistance | LightActiveDistance | SedentaryActiveDistance | VeryActiveMinutes | FairlyActiveMinutes | LightlyActiveMinutes | SedentaryMinutes | Calories |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1503960366 | 4/12/2016 | 13162 | 8.50 | 8.50 | 0 | 1.88 | 0.55 | 6.06 | 0 | 25 | 13 | 328 | 728 | 1985 |
| 1503960366 | 4/13/2016 | 10735 | 6.97 | 6.97 | 0 | 1.57 | 0.69 | 4.71 | 0 | 21 | 19 | 217 | 776 | 1797 |
| 1503960366 | 4/14/2016 | 10460 | 6.74 | 6.74 | 0 | 2.44 | 0.40 | 3.91 | 0 | 30 | 11 | 181 | 1218 | 1776 |
# daily sleep
daily_sleep %>% head(3) %>% knitr::kable()| Id | SleepDay | TotalSleepRecords | TotalMinutesAsleep | TotalTimeInBed |
|---|---|---|---|---|
| 1503960366 | 4/12/2016 12:00:00 AM | 1 | 327 | 346 |
| 1503960366 | 4/13/2016 12:00:00 AM | 2 | 384 | 407 |
| 1503960366 | 4/15/2016 12:00:00 AM | 1 | 412 | 442 |
# weight log
weight_log %>% head(3) %>% knitr::kable() | Id | Date | WeightKg | WeightPounds | Fat | BMI | IsManualReport | LogId |
|---|---|---|---|---|---|---|---|
| 1503960366 | 5/2/2016 23:59 | 52.6 | 115.9631 | 22 | 22.65 | TRUE | 1.46223e+12 |
| 1503960366 | 5/3/2016 23:59 | 52.6 | 115.9631 | NA | 22.65 | TRUE | 1.46232e+12 |
| 1927972279 | 4/13/2016 1:08 | 133.5 | 294.3171 | NA | 47.54 | FALSE | 1.46051e+12 |
Inspection and verification
User inclusion
Let’s start by checking the number of unique users on each of the tables.
paste("Users in daily activity : ",n_distinct(daily_act$Id))[1] “Users in daily activity : 33”
paste("Users in daily sleep : ",n_distinct(daily_sleep$Id))[1] “Users in daily sleep : 24”
paste("Users in weight log : ",n_distinct(weight_log$Id))[1] “Users in weight log : 8”
Daily activity boasts the full set of participants at 33 users, with daily sleep at 24, but weight log’s low user count undermines its validity. At only 8 users, the data collected has the potential to be significantly biased or variant.
We’ll have to analyze activity and sleep side by side at some point, so losing information from 9 users is a sacrifice we’ll have to make. The best we can do is ensure the user constitution of daily sleep has as much overlap with daily activity as possible.
## checking common values between user ID in daily sleep and daily act
act_id <- unique(daily_act$Id) # daily act unique ID's
sleep_id <- unique(daily_sleep$Id) # daily sleep unique ID's
paste("Number of common values: ",length(intersect(act_id, sleep_id)))## [1] "Number of common values: 24"
Best case scenario!
Nulls and duplicates
The next step is customary housekeeping - we will check if either of the two have nulls or duplicate entries (We’ll leave weight log for now). If any data sets end up having nulls or duplicates we will manage them.
Daily Activity
## Daily Activity ###############################################################
paste0("Total Nulls : ",sum(is.na(daily_act)))[1] “Total Nulls : 0”
paste0("Total Duplicates : ",sum(duplicated(daily_act)))[1] “Total Duplicates : 0”
Daily Sleep
## Daily Sleep ##################################################################
paste0("Total Nulls : ",sum(is.na(daily_act)))[1] “Total Nulls : 0”
paste0("Total Duplicates : ",sum(duplicated(daily_act)))[1] “Total Duplicates : 0”
Well, then.
Second Fitbit Data Set
The other data set we have yet to look at is the supplementary CSV file from Anish Kumar that completes the two month arc from March 11 to May 12, 2016.
second_fitbit <- read_csv('second_fitbit_data.csv')
second_fitbit %>% head(3) %>% knitr::kable() # glimpse of data| Id | ActivityDate | TotalSteps | TotalDistance | TrackerDistance | LoggedActivitiesDistance | VeryActiveDistance | ModeratelyActiveDistance | LightActiveDistance | SedentaryActiveDistance | VeryActiveMinutes | FairlyActiveMinutes | LightlyActiveMinutes | SedentaryMinutes | Calories |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1503960366 | 3/25/2016 | 11004 | 7.11 | 7.11 | 0 | 2.57 | 0.46 | 4.07 | 0 | 33 | 12 | 205 | 804 | 1819 |
| 1503960366 | 3/26/2016 | 17609 | 11.55 | 11.55 | 0 | 6.92 | 0.73 | 3.91 | 0 | 89 | 17 | 274 | 588 | 2154 |
| 1503960366 | 3/27/2016 | 12736 | 8.53 | 8.53 | 0 | 4.66 | 0.16 | 3.71 | 0 | 56 | 5 | 268 | 605 | 1944 |
paste("nulls : ",sum(is.na(second_fitbit)))[1] “nulls : 0”
paste("duplicates : ",sum(duplicated(second_fitbit)))[1] “duplicates : 0”
Now, for the moment of truth…
## Checking if this data set's users overlap with those of daily act and daily sleep
secfit_id <- unique(second_fitbit$Id)
paste("number of ID's in second_fitbit: ", length(secfit_id))[1] “number of ID’s in second_fitbit: 35”
paste("Number of shared values: ", length(intersect(act_id, secfit_id))) ## used act_id from before[1] “Number of shared values: 33”
They have as much overlap as possible! We’ll be able to merge this with our daily activity table, but not daily sleep. This data set will be a useful tool for when we want to inspect activity individually over a longer scale of time.
Process
Individual Inspection and Cleaning
Like the Prepare step, this step should be performed every time a data set is loaded up. Processing is an umbrella term for checking and cleaning information in a table. We already know every cell is filled and there are no duplicates, so the content is sufficiently prepped for analysis. Now comes the time to transform the data into something more easily understood by R.
This will consist of a few procedures:
- Transform the date and datetime columns into something more mutable using
lubridate - Get rid of unneeded columns
- Standardize column names
Luckily, we can do all of this in one process using the %>% operator and help from dplyr and janitor.
library(lubridate)
library(janitor) ## for clean_names() --> makes all spaces underscores, caps lowercase
### Daily Activity ******************************
daily_act <- daily_act %>%
mutate(date = mdy(ActivityDate)) %>%
select(-ActivityDate) %>%
clean_names() ## create a datetime object called date, delete original date column, clean names
daily_act %>% head(3) %>% knitr::kable()| id | total_steps | total_distance | tracker_distance | logged_activities_distance | very_active_distance | moderately_active_distance | light_active_distance | sedentary_active_distance | very_active_minutes | fairly_active_minutes | lightly_active_minutes | sedentary_minutes | calories | date |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1503960366 | 13162 | 8.50 | 8.50 | 0 | 1.88 | 0.55 | 6.06 | 0 | 25 | 13 | 328 | 728 | 1985 | 2016-04-12 |
| 1503960366 | 10735 | 6.97 | 6.97 | 0 | 1.57 | 0.69 | 4.71 | 0 | 21 | 19 | 217 | 776 | 1797 | 2016-04-13 |
| 1503960366 | 10460 | 6.74 | 6.74 | 0 | 2.44 | 0.40 | 3.91 | 0 | 30 | 11 | 181 | 1218 | 1776 | 2016-04-14 |
### Daily Sleep *********************************
daily_sleep <- daily_sleep %>%
mutate(date = mdy_hms(SleepDay)) %>%
select(-SleepDay) %>%
clean_names()
daily_sleep %>% head(3) %>% knitr::kable()| id | total_sleep_records | total_minutes_asleep | total_time_in_bed | date |
|---|---|---|---|---|
| 1503960366 | 1 | 327 | 346 | 2016-04-12 |
| 1503960366 | 2 | 384 | 407 | 2016-04-13 |
| 1503960366 | 1 | 412 | 442 | 2016-04-15 |
### Second Fitbit *******************************
second_fitbit <- second_fitbit %>%
mutate(date = mdy(ActivityDate)) %>%
select(-ActivityDate) %>%
clean_names()
second_fitbit %>% head(3) %>% knitr::kable()| id | total_steps | total_distance | tracker_distance | logged_activities_distance | very_active_distance | moderately_active_distance | light_active_distance | sedentary_active_distance | very_active_minutes | fairly_active_minutes | lightly_active_minutes | sedentary_minutes | calories | date |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1503960366 | 11004 | 7.11 | 7.11 | 0 | 2.57 | 0.46 | 4.07 | 0 | 33 | 12 | 205 | 804 | 1819 | 2016-03-25 |
| 1503960366 | 17609 | 11.55 | 11.55 | 0 | 6.92 | 0.73 | 3.91 | 0 | 89 | 17 | 274 | 588 | 2154 | 2016-03-26 |
| 1503960366 | 12736 | 8.53 | 8.53 | 0 | 4.66 | 0.16 | 3.71 | 0 | 56 | 5 | 268 | 605 | 1944 | 2016-03-27 |
Merge
We’ll create two new data sets for further inspection. The first one will be daily act and daily sleep merged, and the other will be a concatenation of both activity data sets.
Speaking of which….(see code: )
## I should probably rename this to something more descriptive
mar_to_apr_act <- second_fitbitact_sleep
We’ll merge using an inner join on the basis of ID and date, and store the result in a table called act_sleep. It’ll be good to remember going forward we’ve lost 8 users in our overall calculations for this data set.
We’ll also reorder the table so that the date shows up on the left side with ID.
act_sleep <- inner_join(daily_act, daily_sleep, by=c("id","date"))
paste0("Total number of ID's : ",n_distinct(act_sleep$id))## [1] "Total number of ID's : 24"
act_sleep %>% head(3) %>% knitr::kable()| id | total_steps | total_distance | tracker_distance | logged_activities_distance | very_active_distance | moderately_active_distance | light_active_distance | sedentary_active_distance | very_active_minutes | fairly_active_minutes | lightly_active_minutes | sedentary_minutes | calories | date | total_sleep_records | total_minutes_asleep | total_time_in_bed |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1503960366 | 13162 | 8.50 | 8.50 | 0 | 1.88 | 0.55 | 6.06 | 0 | 25 | 13 | 328 | 728 | 1985 | 2016-04-12 | 1 | 327 | 346 |
| 1503960366 | 10735 | 6.97 | 6.97 | 0 | 1.57 | 0.69 | 4.71 | 0 | 21 | 19 | 217 | 776 | 1797 | 2016-04-13 | 2 | 384 | 407 |
| 1503960366 | 9762 | 6.28 | 6.28 | 0 | 2.14 | 1.26 | 2.83 | 0 | 29 | 34 | 209 | 726 | 1745 | 2016-04-15 | 1 | 412 | 442 |
paste0("Where is the date column currently located: at column ",grep('date', colnames(act_sleep)))[1] “Where is the date column currently located: at column 15”
## reordering
act_sleep <- act_sleep[,c(1,15,3,4,5,6,7,8,9,10,11,12,13,14,2,16,17,18)]
act_sleep %>% head(3) %>% knitr::kable()| id | date | total_distance | tracker_distance | logged_activities_distance | very_active_distance | moderately_active_distance | light_active_distance | sedentary_active_distance | very_active_minutes | fairly_active_minutes | lightly_active_minutes | sedentary_minutes | calories | total_steps | total_sleep_records | total_minutes_asleep | total_time_in_bed |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1503960366 | 2016-04-12 | 8.50 | 8.50 | 0 | 1.88 | 0.55 | 6.06 | 0 | 25 | 13 | 328 | 728 | 1985 | 13162 | 1 | 327 | 346 |
| 1503960366 | 2016-04-13 | 6.97 | 6.97 | 0 | 1.57 | 0.69 | 4.71 | 0 | 21 | 19 | 217 | 776 | 1797 | 10735 | 2 | 384 | 407 |
| 1503960366 | 2016-04-15 | 6.28 | 6.28 | 0 | 2.14 | 1.26 | 2.83 | 0 | 29 | 34 | 209 | 726 | 1745 | 9762 | 1 | 412 | 442 |
act_extended
For our activity tables, we’ll really just have to concatenate them using rbind(). We’ll put the data set spanning March to April on the left so the other gets attached to its tail.
act_extended <- rbind(mar_to_apr_act, daily_act)
act_extended %>% head(3) %>% knitr::kable()| id | total_steps | total_distance | tracker_distance | logged_activities_distance | very_active_distance | moderately_active_distance | light_active_distance | sedentary_active_distance | very_active_minutes | fairly_active_minutes | lightly_active_minutes | sedentary_minutes | calories | date |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1503960366 | 11004 | 7.11 | 7.11 | 0 | 2.57 | 0.46 | 4.07 | 0 | 33 | 12 | 205 | 804 | 1819 | 2016-03-25 |
| 1503960366 | 17609 | 11.55 | 11.55 | 0 | 6.92 | 0.73 | 3.91 | 0 | 89 | 17 | 274 | 588 | 2154 | 2016-03-26 |
| 1503960366 | 12736 | 8.53 | 8.53 | 0 | 4.66 | 0.16 | 3.71 | 0 | 56 | 5 | 268 | 605 | 1944 | 2016-03-27 |
Let’s take a quick peek at the data types involved. Since act_sleep has all the data types both tables use, we’ll just query a summary for that one.
## Data types present
sapply(act_sleep, class)$id [1] “numeric”
$date [1] “POSIXct” “POSIXt”
$total_distance [1] “numeric”
$tracker_distance [1] “numeric”
$logged_activities_distance [1] “numeric”
$very_active_distance [1] “numeric”
$moderately_active_distance [1] “numeric”
$light_active_distance [1] “numeric”
$sedentary_active_distance [1] “numeric”
$very_active_minutes [1] “numeric”
$fairly_active_minutes [1] “numeric”
$lightly_active_minutes [1] “numeric”
$sedentary_minutes [1] “numeric”
$calories [1] “numeric”
$total_steps [1] “numeric”
$total_sleep_records [1] “numeric”
$total_minutes_asleep [1] “numeric”
$total_time_in_bed [1] “numeric”
The POSIXct data type you see represents a datetime object type that lubridate works with. These offer massive flexibility as we’ll see in a bit. Seems like the rest are all numeric.
With that, we’re ready to analyze!
Analyze
Revisiting our big picture questions
As we approach our strategy for analysis, it would be beneficial to revisit the big picture questions we laid down in the Ask phase. With a little more context from our data sets, we can break these holistic questions down into more targeted queries.
- What are the general trends of activity, both physical and interactive, exhibited by wearable health tech consumers?
- What is the distribution of frequency of fitbit use? Total number of days logged per user?
- Are there certain times of the week or the day that users are most active?
- What is the general difference between time spent in bed and time spent asleep?
- How do consumers measure up to recommended health standards, and each other?
- What is the recommended amount of sleep for adults, and are users meeting it?
- What is the recommended amount of steps taken for adults, and are users meeting it?
- Is there any distinct stratification between users on the bases of one of the variables? Are people spread out in the distribution, or are there clear levels?
- Are there any features that suggest insightful relationships, and are these indicative of something Bellabeat can exploit in its marketing strategy?
- If lacking in sufficient sleep, can we correlate the gap to another feature? (i.e. total steps)
- How strong of a trend is there between calories burned, total intensity, etc.?
General Activity Trends per User
We’ll begin by getting to know our users a little.
Let’s start by seeing if we can find a way to quantify their time spent using the fitbit. We’ll do this sorting act_sleep users by total number of days used, and storing the result in days_used.
Next, to create a categorical variable, we’ll sort users based on the total number of days of activity, breaking our (in this case) 31 day stretch into 3 pieces.
- casual : 0 - 6 days
- moderate : 7 - 15 days
- frequent : 16 - 22 days
- active : more than 22 days
## Create a new DF with new categorical variable
days_used <- act_sleep %>%
group_by(id) %>%
summarize(days_used = n()) %>%
mutate(user_type = case_when(
days_used < 7 ~ "casual",
days_used < 16 ~ "moderate",
days_used < 23 ~ "frequent",
days_used < 33 ~ "active"
))
## Take a look
days_used %>%
group_by(user_type) %>%
summarize(amount = n()) %>%
mutate(percent_of_whole = (amount/24)*100) %>%
knitr::kable()| user_type | amount | percent_of_whole |
|---|---|---|
| active | 12 | 50.000000 |
| casual | 8 | 33.333333 |
| frequent | 1 | 4.166667 |
| moderate | 3 | 12.500000 |
We’ll then manipulate these data sets to gain some insight into our variables.
Proportional Spread of user activity
Count
library(ggplot2)
library(plotly)
library(ggrepel)
library(ggthemes)
## Pie chart - count
days_used_count <- days_used %>%
group_by(user_type) %>%
summarize(amount = n()) ## created a days_used summary column
for_count_pie <- days_used_count %>%
mutate(csum = rev(cumsum(rev(amount))),
pos = amount/2 + lead(csum, 1),
pos = if_else(is.na(pos), amount/2, pos)) ## aux df for fancy labels on the pie chart
ggplot(days_used_count, aes(x="", y=amount, fill=user_type)) +
geom_col(width = 1, color = 1) +
coord_polar(theta = "y", clip="on") +
scale_fill_brewer(palette = "Set1") +
labs(title="Users by frequency of use - Count (out of 24)") +
labs(x=" ") +
geom_label_repel(data=for_count_pie, aes(y=pos, label=paste0(amount)),
size=4.5, nudge_x=1, show.legend=FALSE,alpha=0.85) +
guides(fill=guide_legend(title = "User Type")) +
theme_excel_new() ## generate chartPercentage
## Pie chart - percent
days_used_percent <- days_used %>%
group_by(user_type) %>%
summarize(amount = n()) %>%
mutate(percentage_raw = amount/24) %>%
mutate(percent_of_whole = scales::percent(percentage_raw)) ## percentage summary df
for_pie2 <- days_used_percent %>%
mutate(csum = rev(cumsum(rev(amount))),
pos = amount/2 + lead(csum, 1),
pos = if_else(is.na(pos), amount/2, pos)) ## aux df to help with ggrepel
ggplot(days_used_percent, aes(x="", y=amount, fill=user_type)) +
geom_col(width = 1, color = 1) +
coord_polar(theta = "y", clip="on") +
scale_fill_brewer(palette = "Set1") +
labs(title="Users by frequency of use") +
labs(x=" ") +
geom_label_repel(data=for_pie2, aes(y=pos, label=paste0(percent_of_whole)),
size=4.5, nudge_x=1, show.legend=FALSE,alpha=0.85) +
guides(fill=guide_legend(title = "User Type")) +
theme_excel_new() ## Generate chartDead half of the users use the watch most frequently, and a third use it least frequently as according to our levels. There is very little in-between, suggesting a higher than average engagement rate. From this point on, I chose to group all the users who weren’t explicitly “active” (whether they accessesed the technology frequently, moderately, or casually) in one since they each were a small enough proportion, and relatively similar in property.
Popular Weekdays by User Type
### join
extended_with_userType <- left_join(act_sleep, days_used, by="id")
### renaming user_type for clarity
extended_with_userType <- extended_with_userType %>% rename(frequency_of_use = user_type)
### Making active and non-active datasets
active_users <- extended_with_userType %>% filter(frequency_of_use=='active')
nonActive_users <- extended_with_userType %>% filter(frequency_of_use!='active')
### For later on
weekday_list <- c("Monday","Tuesday","Wednesday","Thursday","Friday","Saturday","Sunday")Active Users
active_use <- active_users %>% mutate(weekday = weekdays(date)) %>% group_by(weekday) %>% summarize(count = n())
v_act <- ggplot(active_use, aes(x=factor(weekday, level=weekday_list), y=count)) +
geom_bar(stat='identity', color='blue', fill='skyblue', alpha=0.5) +
scale_x_discrete(labels = c("Monday",
"Tuesday",
"Wednesday",
"Thursday",
"Friday",
"Saturday",
"Sunday"))
ggplotly(v_act)Active users tend to log more time in during the week. The majority of days of use are Monday-Thursday.
Non-active Users
nonActive_use <- nonActive_users %>% mutate(weekday = weekdays(date)) %>% group_by(weekday) %>% summarize(count = n())
v_act <- ggplot(nonActive_use, aes(x=factor(weekday, level=weekday_list), y=count)) +
geom_bar(stat='identity', color='blue', fill='skyblue', alpha=0.5) +
scale_x_discrete(labels = c("Monday",
"Tuesday",
"Wednesday",
"Thursday",
"Friday",
"Saturday",
"Sunday"))
ggplotly(v_act)Non-active users are most likely to put time in during the weekend. Maybe non-active users can be encouraged to make the watch more of a weekly habit than a weekend thing.
Productive Weekdays by User Type
Active Users
active_cals <- active_users %>% mutate(weekday = weekdays(date)) %>% group_by(weekday) %>% summarize(mean_cals = mean(calories))
v_act <- ggplot(active_cals, aes(x=factor(weekday, level=weekday_list), y=mean_cals)) +
geom_bar(stat='identity', color='blue', fill='skyblue', alpha=0.5)
ggplotly(v_act)There is a dip in use mid-to-late week. There might be reward in programming an app feature that gives them a pick-me-up for that time period.
Non-active Users
nonActive_cals <- nonActive_users %>% mutate(weekday = weekdays(date)) %>% group_by(weekday) %>% summarize(mean_cals = mean(calories))
v_act <- ggplot(nonActive_cals, aes(x=factor(weekday, level=weekday_list), y=mean_cals)) +
geom_bar(stat='identity', color='blue', fill='skyblue', alpha=0.5) +
scale_x_discrete(labels = c("Monday",
"Tuesday",
"Wednesday",
"Thursday",
"Friday",
"Saturday",
"Sunday"))
ggplotly(v_act)Non-active users are less consistent overall than active users.
All About Sleep
Is there a relationship?
### ggplot object
extended_with_userType <- extended_with_userType %>% mutate(diff=total_time_in_bed-total_minutes_asleep)
p <- ggplot(extended_with_userType, aes(x=diff,
y=calories,
color=frequency_of_use)) +
geom_jitter() +
geom_smooth()
### plotly
ggplotly(p)Not much of a relationship. Maybe a slight inverse relationship, but the majority of data is concentrated within 0-100 anyway.
Sleep by Weekday
diff_weekdays <- extended_with_userType %>% mutate(weekday = weekdays(date)) %>% group_by(weekday) %>% summarize(mean_sleep = mean(diff))
d_act <- ggplot(diff_weekdays, aes(x=factor(weekday, level=weekday_list), y=mean_sleep)) +
geom_bar(stat='identity', color='blue', fill='skyblue', alpha=0.5) +
scale_x_discrete(labels = c("Monday",
"Tuesday",
"Wednesday",
"Thursday",
"Friday",
"Saturday",
"Sunday"))
ggplotly(d_act)People struggle to sleep the most on Sunday. Sunday meditation?
Total Minutes Asleep vs Total Steps
### ggplot object
p <- ggplot(extended_with_userType, aes(x=total_steps,
y=total_minutes_asleep,
color=frequency_of_use)) +
geom_jitter()
### plotly
ggplotly(p)Not much of a relationship if any.
The Recommended Values Scoreboard
Recommended Steps
### ggplot object
p1 <- ggplot(extended_with_userType, aes(x=total_steps, fill=frequency_of_use)) +
geom_density(alpha=0.75) +
geom_vline(xintercept = 10000)
### plotly
ggplotly(p1)Active users tend to reach their daily step goal more often.
Recommended Sleep
### ggplot object
p2 <- ggplot(extended_with_userType, aes(x=total_minutes_asleep, fill=frequency_of_use)) +
geom_density(alpha=0.75) +
geom_vline(xintercept = 420)
### plotly
ggplotly(p2)Active users tend to reach their daily sleep goal more often.
Conclusions
Final Conclusions and Business Ideas
Wearable health technology seems to either stick completely, or never catch on with someone. The question of reliable use is binary - the majority of users either make it part of their routine or lose momentum with the device quickly. Those that do use it more frequently tend to burn more calories, and are more likely to get the recommended amount of steps and sleep in per day.
This is proven evidence that marketing can use as a fact for advertising their wearable health technology. Wearing their technology increases the chances that the wearer will see a healthy lifestyle change if they stick with it.
Fitbit use is most prevalent on the weekends for casual users with peaks and valleys in between, whereas for active users, daily fitbit use is much more sustained throughout the week, with Monday-Thursday being a sustained peak.
There seems to be a dip in productivity near the middle of the week for active users. In addition, active users spend the most time in bed awake on Sunday. Maybe the app can feature a mid-week stress-relief meditation track, along with a restless night meditation track, or even partner with a health meditation app like HeadSpace to do so. This could have a positive effect on user’s productivity and total minutes asleep when compared to total minutes in bed.
Afterthoughts
Though this data is insightful, to find more sustained evidence I’d want to run significance tests on this. The sample size was pretty small, and this did only account for six weeks. I had a lot of trouble looking at minute-to-minute intensities, and I don’t think I’m yet quite skilled enough to glean substantial information from it that isn’t redundant with insights from other tables. The business objective asks for macro-level approaches to marketing and feature inclusion, so I felt like it wasn’t all that important.
If I had superpowers, I’d like to chart minute-to-minute intensities of active users vs non-active users and cross reference them with mean total steps or calories burned to understand the effect step count has on intensity.
Another thing I’d have liked to do would be to use bootstrapping techniques to resample each group to get a more normalized mean distribution. That way I can be more sure whether differences between the different groups of users are statistically significant or not.
All in all, this was a fun project and it definitely laid the groundwork for my approach to exploratory data analyses in the future.